{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# {fas}`book;pst-color-primary` {daobook}`xgrammar`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "{daobook}`XGrammar <xgrammar>` {cite:p}`dong2024xgrammarflexibleefficientstructured` 开源解决方案，旨在实现灵活、便携且快速的结构化生成。该项目的使命是将灵活的无开销结构化生成带到每个角落。\n",
    "\n",
    "XGrammar 是由陈天奇团队推出的开源软件库，能为⼤型语⾔模型（LLM）提供⾼效、灵活且可移植的结构化数据⽣成能⼒。基于上下⽂⽆关语法（CFG）定义结构，⽀持递归组合以表⽰复杂结构，适合⽣成JSON、SQL等格式数据。XGrammar ⽤字节级下推⾃动机优化解释 CFG，减少每 token 延迟，实现百倍加速，⼏乎⽆额外开销。XGrammar 集成多种系统优化，如⾃适应 token 掩码缓存、上下⽂扩展等，提⾼掩码⽣成速度并减少预处理时间。XGrammar 的 C++ 后端设计易于集成，并⽀持在 LLM 推理中实现零开销的结构化⽣成。\n",
    "\n",
    "```{admonition} XGrammar 的主要功能\n",
    "- ⾼效结构化⽣成：⽀持上下⽂⽆关语法（CFG），⽀持定义和⽣成遵循特定格式（如JSON、SQL）的结构化数据。\n",
    "- 灵活性：基于CFG的递归规则，能灵活地表⽰复杂的结构，适应多样的结构化数据需求。\n",
    "- 零开销集成：XGrammar与LLM推理引擎共同设计，能在LLM推理中实现零开销的结构化⽣成。\n",
    "- 快速执⾏：基于系统优化，显著提⾼结构化⽣成的执⾏速度，相⽐于SOTA⽅法，每token延迟减少多达100倍。\n",
    "- 跨平台部署：具有最⼩且可移植的C++后端，能轻松集成到多个环境和框架中。\n",
    "- ⾃适应token掩码缓存：在预处理阶段⽣成，加快运⾏时的掩码⽣成。\n",
    "```\n",
    "```{admonition} XGrammar 的技术原理\n",
    "- 字节级下推⾃动机（PDA）：⽤字节级PDA解释CFG，⽀持每个字符边缘包含⼀个或多个字节，处理不规则的token边界，⽀持包含sub-UTF8字符的token。\n",
    "- 预处理和运⾏时优化：在预处理阶段，⽣成⾃适应token掩码缓存，基于预先计算与上下⽂⽆关的token加快运⾏时的掩码⽣成。\n",
    "- 上下⽂⽆关与相关token的区分：区分上下⽂⽆关token和上下⽂相关token，预先计算PDA中每个位置的上下⽂⽆关token的有效性，并将它们存储在⾃适应token掩码缓存中。\n",
    "- 语法编译：基于语法编译过程，预先计算掩码中相当⼀部分token，加快掩码⽣成速度。\n",
    "- 算法和系统优化：包括上下⽂扩展、持续性执⾏堆栈、下推⾃动机结构优化等，进⼀步提⾼掩码⽣成速度并减少预处理时间。\n",
    "- 掩码⽣成与LLM推理重叠：将CPU上的掩码⽣成过程与GPU上的LLM推理过程并⾏化，消除约束解码的开销\n",
    "```\n",
    "```{admonition} XGrammar 的应⽤场景\n",
    "- 编程语⾔辅助：⽤于辅助编写和调试代码，⾃动⽣成符合特定编程语⾔规范的代码⽚段，提⾼开发效率。\n",
    "- 数据库操作：⽣成符合SQL语法的查询语句，帮助开发者或应⽤程序⾃动构建数据库查询，减少⼿动编写SQL语句的⼯作量。\n",
    "- ⾃然语⾔处理（NLP）：⽣成结构化的训练数据，⽤于训练和优化NLP模型，提⾼模型对结构化信息的处理能⼒。\n",
    "- Web开发：⾃动⽣成前端代码和API⽂档，确保⽂档与代码的⼀致性，提⾼开发效率和维护性。\n",
    "- 配置⽂件和模板：⽣成和填充配置⽂件及模板，如⾃动化⽣成系统配置、填充邮件模板等，提⾼⾃动化⽔平。\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from set_env import temp_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 安装\n",
    "\n",
    "`pip` 安装\n",
    "```bash\n",
    "pip install xgrammar\n",
    "```\n",
    "\n",
    "当使用 NVIDIA GPU 时，请同时安装这些额外的依赖项以启用 CUDA 支持来应用位掩码：\n",
    "```bash\n",
    "pip install cuda-python nvidia-cuda-nvrtc-cu12\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 测试"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下载模型：\n",
    "```bash\n",
    "git clone https://www.modelscope.cn/LLM-Research/Llama-3.2-1B-Instruct.git {temp_dir}/LLM-Research/Llama-3.2-1B-Instruct\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "实例化模型、分词器和输入："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-01-06 14:55:49.127058: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
      "E0000 00:00:1736146549.148811 3636295 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "E0000 00:00:1736146549.155424 3636295 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
      "2025-01-06 14:55:49.178656: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
      "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig\n",
    "\n",
    "device = \"cuda\"  # Or \"cpu\", etc.\n",
    "model_name = f\"{temp_dir}/LLM-Research/Llama-3.2-1B-Instruct\"\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    model_name, torch_dtype=torch.float32, device_map=device\n",
    ")\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "config = AutoConfig.from_pretrained(model_name)\n",
    "\n",
    "messages = [\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
    "    {\"role\": \"user\", \"content\": \"Introduce yourself in JSON briefly.\"},\n",
    "]\n",
    "texts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
    "model_inputs = tokenizer(texts, return_tensors=\"pt\").to(model.device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 编译语法\n",
    "\n",
    "构建 `GrammarCompiler` 并编译语法。\n",
    "\n",
    "语法可以是内置的 JSON 语法、JSON 模式字符串或EBNF字符串。EBNF 提供了更高的自定义灵活性。有关规范，请参阅 [GBNF 文档](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xgrammar as xgr\n",
    "tokenizer_info = xgr.TokenizerInfo.from_huggingface(tokenizer, vocab_size=config.vocab_size)\n",
    "grammar_compiler = xgr.GrammarCompiler(tokenizer_info)\n",
    "compiled_grammar = grammar_compiler.compile_builtin_json_grammar()\n",
    "# Other ways: provide a json schema string\n",
    "# compiled_grammar = grammar_compiler.compile_json_schema(json_schema_string)\n",
    "# Or provide an EBNF string\n",
    "# compiled_grammar = grammar_compiler.compile_grammar(ebnf_string)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用语法生成\n",
    "\n",
    "使用 `LogitsProcessor` 结合语法进行生成。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"name\": \"Assistant\",\n",
      "  \"id\": \"AI-Assistant\",\n",
      "  \"description\": \"A helpful assistant designed to assist with a wide range of tasks and questions\",\n",
      "  \"image\": \"https://example.com/assistant-icon.png\",\n",
      "  \"skills\": [\"Conversational AI\", \"Text-based interaction\", \"Language understanding\"]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "xgr_logits_processor = xgr.contrib.hf.LogitsProcessor(compiled_grammar)\n",
    "generated_ids = model.generate(\n",
    "    **model_inputs, max_new_tokens=512, logits_processor=[xgr_logits_processor]\n",
    ")\n",
    "generated_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]\n",
    "print(tokenizer.decode(generated_ids, skip_special_tokens=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ai",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
